Jake Neyer
January 14, 2018
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Input variables (based on physicochemical tests):
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do
not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of
levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and
flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s
rare to find wines with less than 1 gram/liter and wines with greater than 45
grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations,
SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2
becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the
percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic)
to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02)
levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
And a more human-readable version:
| X | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 2 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 3 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 4 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 6 | 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
Checking out the acidity of the wines:
Barring citric acid content, the acidity seems to generally be normally distrobuted.
Fixed acidity is centered around 6-8g/dm^3 and violatile acidity is generally
between 0.4 and 0.7g/dm^3. I will look more into this in the following section.
In order to normalize volatile acid and citric acid contents, I performed a
logrithmic transformations on the two variables.
Taking a look at the other factors:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar content is normally distrobuted around 2.2 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chloride content is normally distributed around 0.0874 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
In order to normalized the distrobution, I performed a logrithmic transformation
on the free SO2 content. It appears to be for the most part normally distributed
around 14/15 g/dm^3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The total SO2 content is generally around 46 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphate content is normally distrobuted around .6 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density of the wines is normally distrobuted aroung .996 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
I used a logrithmic transformation to try to normalize the alcohol content, but
it is still skewed significantly to the right.
Taking a look at quality distrobutions:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Quality ratings of 5 and 6 are extremely common; more so than any other. It is
very rare for ratings to be 7 or above.
Taking a look at the modified quality distrobutions (Bad, Good, and Great):
‘Bad’ rated wines are the most frequent, ‘Good’ wines are in close second, and
there are even fewer ‘Great’ wines in the dataset.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality score
## Min. : 8.40 Min. :3.000 Bad :744
## 1st Qu.: 9.50 1st Qu.:5.000 Good :638
## Median :10.20 Median :6.000 Great:217
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The dataset contains 1599 red wines with 13 features. Quality is rated on a scale
from 0 (Very Bad) to 10 (Very Excellent). The most common quality ratings are 5
and 6 in this particular dataset. I wonder which features play the largest role
in determing the quality rating. I am also curious to better understand how
different variables affect others, such as alcohol content and density or acidity
and residual sugar. In the following section I will begin to understand how the
different variables interact and also understand which variables play roles in
determining the overall quality of the wine.
Citric acid seems to be correlated with many other variables in this dataset.
Density also seems to be correlated with many other variables.
Taking a look at acidity as it relates to other variables:
## [1] 0.6717034
Citric acid has a strong, positive relationship with fixed acidity.
## [1] -0.5524957
Citric acid has a strong, negative relationnship with volatile acidity.
## [1] 0.6680473
Density has a strong, positive relationship with the fixed acidity.
## [1] -0.4961798
Density has a negative relationship with alcohol content. Generally speaking, wines with a higher alcohol content have a lower density.This makes sense because
alcohol is less dense than water.
## [1] 0.4761663
This shows not only the frequency of each quality rating, but also the statistical
breakdown. The box plots give a good indication of what alcohol content each of the
quality ratings have.
I took a look at how the variables interacted with each other. Acidity clearly
plays a large role in the wine. Both fixed and volatile acidity have strong
relationships with many of the variables. The strongest relationship, maybe not
so suprisingly is between citric acid and the fixed acidity of the wine. Another
rather strong relationship is one between alcohol content and the quality of the
wine.
Here is a detailed breakdown of the input variables relationship with the wine
quality:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,] 0.1240516 -0.3905578 0.2263725 0.01373164 -0.1289066
## free.sulfur.dioxide total.sulfur.dioxide density pH
## [1,] -0.05065606 -0.1851003 -0.1749192 -0.05773139
## sulphates alcohol
## [1,] 0.2513971 0.4761663
The quality scale is difficult to visualize. Breaking the ratings into larger
levels will help visualize the data more easily.
Running it again I get this:
As mentioned before, there is a stong relationship between acidity and citric acid
content. Additionally, great wines seem to have less violatile acidity and higher
fixed acidity.
Better quality wines seem to have more sulphates. Additionally, lower quality wines
are concentrated in the area of low sulphate content and low alcohol content.
PH levels do not seem to have much of an effect on wine quality, but once again we see sulphates do seem to play a role in the quality rating.
Lower quality wines tend to have lower alcohol content. This is an interesting visualization because it reiterates the size of each quality rating in a different
way.
Alcohol content clear plays one of the largest roles in determing the quality of
the wines. Additional variables such as volatile acidity, fixed acidity, sulphate
content and citric acid content also play a slightly less significant role in determining
the quality.
This plot displays the quality distrobution of the wine quality in the dataset
and gives good insight on the percentage breakdown of the dataset based on quality.
Wines with low sulphate content and low alcohol percentage are almost certainly
low quality wines.
This plot shows the clear stratification of wine quality based on alcohol content.
I chose this as my final plot because it highlights the fact the alcohol content
is a large determining factor in the wine’s quality.
Through my exploratory data analysis performed on the ‘Wine QUality’ dataset, I
discovered key variables that determine the quality of a wine. I identified alcohol
as playing one of the largest roles in determining quality. Additionally, fixed and
volatile acidity also play key roles in determining the quality. A larger dataset
would be nice to explore the variables in more detail. Moving ahead, I would like
to apply some different models to this dataset to run quality tests based on built-out
algorithms.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html